R at scale on the Google Cloud Platform

Mark Edmondson (@HoloMarkeD)

May 20th, 2019 - CopenhagenR

code.markedmondson.me

fgf

Credentials

My R Timeline

  • Digital agencies since 2007
  • useR since 2012 - Motive: how to use all this web data?
  • Shiny enthusiast e.g. https://gallery.shinyapps.io/ga-effect/
  • Google Developer Expert - Google Analytics & Google Cloud
  • Several Google API themed packages on CRAN via googleAuthR
  • Part of cloudyr group (AWS/Azure/GCP R packages for the cloud) https://cloudyr.github.io/
  • Now: Data Engineer @ IIH Nordic

GA Effect

ga-effect

googleAuthRverse

  • searchConsoleR
  • googleAuthR
  • googleAnalyticsR
  • googleComputeEngineR (Cloudyr)
  • bigQueryR (Cloudyr)
  • googleCloudStorageR (Cloudyr)
  • googleLanguageR (rOpenSci)

Slack group to talk around the packages #googleAuthRverse

Scale (almost) always starts with Docker containers

Dockerfiles from The Rocker Project

https://www.rocker-project.org/

Maintain useful R images

  • rocker/r-ver
  • rocker/rstudio
  • rocker/tidyverse
  • rocker/shiny
  • rocker/ml-gpu

Thanks to Rocker Team

rocker-team

Dockerfiles

FROM rocker/tidyverse:3.6.0
MAINTAINER Mark Edmondson (r@sunholo.com)

# install R package dependencies
RUN apt-get update && apt-get install -y \
    libssl-dev 

## Install packages from CRAN
RUN install2.r --error \ 
    -r 'http://cran.rstudio.com' \
    googleAuthR \ 
    googleComputeEngineR \ 
    googleAnalyticsR \ 
    searchConsoleR \ 
    googleCloudStorageR \
    bigQueryR \ 
    ## install Github packages
    && installGithub.r MarkEdmondson1234/youtubeAnalyticsR \
    ## clean up
    && rm -rf /tmp/downloaded_packages/ /tmp/*.rds \

Docker + R = R in Production

  • Flexible No need to ask IT to install R places, just run docker run Cross cloud, ascendent tech

  • Version controlled No worries new package releases will break code

  • Scalable Run multiple Docker containers at once, fits into event-driven, stateless serverless future

Creating Docker images with Cloud Build

Continuous development with GitHub pushes

build-triggers

Scaling R scripts, Shiny apps and APIs

Strategies to scale R

  • Vertical scaling - increase the size and power of one machine
  • Horizontal scaling - split up your problem into lots of little machines
  • Serverless scaling - send your code + data into cloud and let them sort out how many machines

Vertical scaling

Bigger boat

bigger-boat

Bigger VMs

Good for one-off workloads

Pros

Probably run the same code with no changes needed
Easy to setup

Cons

Expensive
May be better to have data in database

Launching a monster VM in the cloud

3.75TB of RAM: $423 a day

RStudio Server

rstudio-server

Standard VM serving Shiny

Cloud computing considerations

  • Only charged for uptime, can configure lots of VMs so…
  • Have lots of specialised VMs (Docker images) not one big workstation
  • Keep code and data seperate e.g. googleCloudStorageR or bigQueryR
  • Consider VMs as like functions of computing power

Horizontal scaling

Lots of little machines can accomplish great things

dunkirk

Parellise your code

Good for parallelisable data tasks

Pros

Fault redundency
Forces repeatable/reproducable infrastructure
library(future) makes parallel processing very useable

Cons

Changes to your code for split-map-reduce
Write meta code to handle I/O data and code
Not applicable to some problems

Adopt a split-map-reduce mindset

  • Break problems down into stateless lumps
  • Reuseable bricks that can be applied to other tasks

Setup a cluster

New in googleComputeEngineR v0.3 - shortcut that launches cluster, checks authentication for you

library(future)

googleComputeEngineR has custom method for future::as.cluster

Forecasting example

Multi-layer future loops

Can multi-layer future loops (use each CPU within each VM)

Thanks for Grant McDermott for figuring optimal method (Issue #129)

CPU utilization

3 VMs, 8 CPUs each = 24 threads

Serverless scaling

We spoke previously of

Clusters of VMs + Docker = Horizontal scaling

Kubernetes

Clusters of VMs + Docker + Task controller = Kubernetes

Kubernetes

Good for Shiny / R APIs

Pros

Auto-scaling, task queues etc.
Scale to billions
Potentially cheaper
May already have cluster in your organisation

Cons

Needs stateless, idempotent workflows
Message broker?
Minimum 3 VMs

Dockerfiles for Shiny apps

Built on Cloud Build upon GitHub push:

FROM rocker/shiny
MAINTAINER Mark Edmondson (r@sunholo.com)

# install R package dependencies
RUN apt-get update && apt-get install -y \
    libssl-dev
    
## Install packages from CRAN needed for your app
RUN install2.r --error \ 
    -r 'http://cran.rstudio.com' \
    googleAuthR \
    googleAnalyticsR

## assume shiny app is in build folder /shiny
COPY ./shiny/ /srv/shiny-server/myapp/

Kubernetes deployments - Shiny

Shiny App:

kubectl run shiny1 \
  --image gcr.io/gcer-public/shiny-googleauthrdemo:latest \
  --port 3838

kubectl expose deployment shiny1 \
  --target-port=3838  --type=NodePort

Dockerfiles for plumber APIs

Built on Cloud Buid every GitHub push:

FROM trestletech/plumber

# copy your plumbed R script     
COPY api.R /api.R

# default is to run the plumbed script
CMD ["api.R"]

Kubernetes deployments - Plumber

R plumber API:

kubectl run my-plumber \
  --image gcr.io/your-project/my-plumber \
  --port 8000

kubectl expose deployment my-plumber \
  --target-port=8000  --type=NodePort

Shiny apps waiting for service

shiny-kubernetes

Expose your workloads via Ingress

apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: r-ingress-nginx
spec:
  rules:
  - http:
      paths:
      - path: /gar/
      # app deployed to /gar/shiny/
        backend:
          serviceName: shiny1
          servicePort: 3838

Apps available at URL on demand

curl 'http://mydomain.com/api/echo?msg="its alive!"'
#> "The message is: its alive!"

shiny-app-on-k8s

I thought I knew a bit about R and Google Cloud but then…

GoogleNext19 - Data Science at Scale with R on GCP

A 40 mins talk at Google Next19 with lots of new things to try!

https://www.youtube.com/watch?v=XpNVixSN-Mg&feature=youtu.be

next-intro

New concepts

Great video that goes more into Spark clusters, Jupyter notebooks, training using ML Engine and scaling using Seldon on Kubernetes that I haven’t tried yet

next19

Some shots from the video

Google Cloud Platform - Serverless Pyramid

Google Cloud Platform - R applications

Conclusions

Take-aways

  • Anything scales on Google Cloud Platform, including R
  • Docker docker docker
  • library(future)
  • Pick scaling stategy most suitable for you

Gratitude

  • Thank you for listening
  • Thanks to Kenneth for inviting me
  • Thanks to RStudio for all their cool things. Support them by buying their stuff.
  • Thanks again to Rocker
  • Thanks to Google for Developer Expert programme and building cool stuff.

Say hello afterwards